Damien Ha ECE M148

Project 3 - Classify your own data

For this project we're going to explore some of the new topics since the last project including Decision Trees and Un-supervised learning. The final part of the project will ask you to perform your own data science project to classify a new dataset.

Submission Details

Project is due June 14th at 11:59 am (Wednesday Afternoon). To submit the project, please save the notebook as a pdf file and submit the assignment via Gradescope. In addition, make sure that all figures are legible and sufficiently large. For best pdf results, we recommend downloading Latex and print the notebook using Latex.

Loading Essentials and Helper Functions

Example Project using new techniques

Since project 2, we have learned about a few new models for supervised learning(Decision Trees and Neural Networks) and un-supervised learning (Clustering and PCA). In this example portion, we will go over how to implement these techniques using the Sci-kit learn library.

Load and Process Example Project Data

For our example dataset, we will use the Breast Cancer Wisconsin Dataset to determine whether a mass found in a body is benign or malignant. Since this dataset was used as an example in project 2, you should be fairly familiar with it.

Feature Information:

Column 1: ID number

Column 2: Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter^2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension ("coastline approximation" - 1)

Due to the statistical nature of the test, we are not able to get exact measurements of the previous values. Instead, the dataset contains the mean and standard error of the real-valued features.

Columns 3-12 present the mean of the measured values

Columns 13-22 present the standard error of the measured values

Supervised Learning: Decision Tree

Classification with Decision Tree

Parameters for Decision Tree Classifier

In Sci-kit Learn, the following are just some of the parameters we can pass into the Decision Tree Classifier:

Visualizing Decision Trees

Scikit-learn allows us to visualize the decision tree to see what features it choose to split and what the result is. Note that if the condition in the node is true, you traverse the left edge of the node. Otherwise, you traverse the right edge.

We can even look at the tree in a textual format.

Feature Importance in Decision Trees

Decision Trees can also assign importance to features by measuring the average decrease in impurity (i.e. information gain) for each feature. The features with higher decreases are treated as more important.

We can clearly see that "concave points_mean" has the largest importance due to it providing the most reduction in the impurity.

Visualizing decision boundaries for Decision Trees

Similar to project 2, lets see what decision boundaries that a Decision Tree creates. We use the two most correlated features to the target labels: concave_points_mean and perimeter_mean.

We can see that the model gets more and more complex with increasing depth until it converges somewhere in between depth 10 and 15.

Supervised Learning: Multi-Layer Perceptron (MLP)

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks are very powerful tools that are used a in a variety of applications including image and speech processing. In class, we have discussed one of the earliest types of neural networks known as a Multi-Layer Perceptron.

steps

Using MLP for classification

Parameters for MLP Classifier

In Sci-kit Learn, the following are just some of the parameters we can pass into MLP Classifier:

Visualizing decision boundaries for MLP

Now, lets see how the decision boundaries change as a function of both the activation function and the number of hidden layers.

Unsupervised learning: PCA

As shown in lecture, PCA is a valuable dimensionality reduction tool that can extract a small subset of valuable features. In this section, we shall demonstrate how PCA can extract important visual features from pictures of subjects faces. We shall use the AT&T Database of Faces. This dataset contains 40 different subjects with 10 samples per subject which means we a dataset of 400 samples.

We extract the images from the scikit-learn dataset library. The library imports the images (faces.data), the flatten array of images (faces.images), and which subject eacj image belongs to (faces.target). Each image is a 64 by 64 image with pixels converted to floating point values in [0,1].

Eigenfaces

The following codes downloads and loads the face data.

Now, let us see what features we can extract from these face images.

The following plots the top 30 PCA components with how much variance does this feature explain.

Amazing! We can see that the model has learned to focus on many features that we as humans also look at when trying to identify a face such as the nose,eyes, eyebrows, etc.

With this feature extraction, we can perform much more powerful learning.

Feature Extraction for Classification

Lets see if we can use PCA to improve the accuracy of the decision tree classifier.

Clearly, we get a much better accuracy for the model while using fewer features. But does the features the PCA thought were important the same features that the decision tree used. Lets look at the feature importance of the tree. The following plot numbers the first principal component as 0, the second as 1, and so forth.

Amazingly, the first and second components were the most important features in the decision tree. Thus, we can claim that PCA has significantly improved the performance of our model.

Unsupervised learning: Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups. One major algorithm we learned in class is the K-Means algorithm.

Evaluating K-Means performance

While there are many ways to evaluate the performance measure of clustering algorithsm, we will focus on the inertia score of the K-Means model. Inertia is another term for the sum of squared distances of samples to their closest cluster center.

Let us look at how the Inertia changes as a function of the number of clusters for an artificial dataset.

From the plot, we can see that when the number of clusters of K-means is the correct number of clusters, Inertia starts decreasing at a much slower rate. This creates a kind of elbow shape in the graph. For K-means clustering, the elbow method selects the number of clusters where the elbow shape is formed. In this case, we see that this method would produce the correct number of clusters.

Lets try it on the cancer dataset.

Here we see that the elbow is not as cleanly defined. This may be due to the dataset not being a good fit for K-means. Regardless, we can still apply the elbow method by noting that the slow down happens around 7~14.

Kmeans on Eigenfaces

Now, lets see how K-means performs in clustering the face data with PCA.

While the algorithm isn't perfect, we can see that K-means with PCA is picking up on some facial similarity or similar expressions.

(100 pts) Todo: Use new methods to classify heart disease

To compare how these new models perform with the other models discussed in the course, we will apply these new models on the heart disease dataset that was used in project 2.

Background: The Dataset (Recap)

For this exercise we will be using a subset of the UCI Heart Disease dataset, leveraging the fourteen most commonly used attributes. All identifying information about the patient has been scrubbed. You will be asked to classify whether a patient is suffering from heart disease based on a host of potential medical factors.

The dataset includes 14 columns. The information provided by each column is as follows:

Preprocess Data

This part is done for you since you would have already completed it in project 2. Use the train, target, test, and target_test for all future parts. We also provide the column names for each transformed column for future use.

The following shows the baseline accuracy of simply classifying every sample as the majority class.

(25 pts) Decision Trees

[5 pts] Apply Decision Tree on Train Data

Apply the decision tree on the train data with default parameters of the DecisionTreeClassifier. Report the accuracy and print the confusion matrix. Make sure to use random_state = 0 so that your results match ours.

[5 pts] Visualize the Decision Tree

Visualize the first two layers of the decision tree that you trained.

What is the gini index improvement of the first split?

Response: 0.496 - ((0.486 * 0.283) + (0.514 * 0.403)) = 0.15132

[5 pts] Plot the importance of each feature for the Decision Tree

How many features have non-zero importance for the Decision Tree? If we remove the features with zero importance, will it change the decision tree for the same sampled dataset?

Response: There are 16 features with non-zero importance, and removing the ones with zero importance will not change the decision tree for the same sampled data.

[10 pts] Optimize Decision Tree

While the default Decision Tree performs fairly well on the data, lets see if we can improve performance by optimizing the parameters.

Run a GridSearchCV with 3-Fold Cross Validation for the Decision Tree. Find the best model parameters amongst the following:

After using GridSearchCV, print the best model parameters and the best score.

Using the best model you have, report the test accuracy and print out the confusion matrix

(20 pts) Multi-Layer Perceptron

[5 pts] Applying a Multi-Layer Perceptron

Apply the MLP on the train data with hidden_layer_sizes=(100,100) and max_iter = 800. Report the accuracy and print the confusion matrix. Make sure to set random_state=0.

[10 pts] Speedtest between Decision Tree and MLP

Let us compare the training times and prediction times of a Decision Tree and an MLP. Time how long it takes for a Decision Tree and an MLP to perform a .fit operation (i.e. training the model). Then, time how long it takes for a Decision Tree and an MLP to perform a .predict operation (i.e. predicting the testing data). Print out the timings and specify which model was quicker for each operation. We recommend using the time python module to time your code. An example of the time module was shown in project 2. Use the default Decision Tree Classifier and the MLP with the previously mentioned parameters.

[5 pts] Compare and contrast Decision Trees and MLPs.

Describe at least one advantage and disadvantage of using an MLP over a Decision Tree.

Response: One advantage of MLP over Decision Tree is that it can capture non-linear relationships. A disadvantage is that its internal workings are very complex and is overall more difficult to interpret compared to Decision Tree

(35 pts) PCA

[5 pts] Transform the train data using PCA

Train a PCA model to project the train data on the top 10 components. Print out the 10 principal components. Look at the documentation of PCA for reference.

[5 pts] Percentage of variance explained by top 10 principal components

Using PCA's "explained_varianceratio", print the percentage of variance explained by the top 10 principal components.

[5 pts] Transform the train and test data into train_pca and test_pca using PCA

Note: Use fit_transform for train and transform for test

[5 pts] PCA+Decision Tree

Train the default Decision Tree Classifier using train_pca. Report the accuracy using test_pca and print the confusion matrix.

Does the model perform better with or without PCA?

Response: The model seems to perform better with PCA

[5 pts] PCA+MLP

Train the MLP classifier with the same parameters as before using train_pca. Report the accuracy using test_pca and print the confusion matrix.

Does the model perform better with or without PCA?

Response: The model seems to perform better without PCA

[10 pts] Pros and Cons of PCA

In your own words, provide at least two pros and at least two cons for using PCA

Response: A pro of PCA is that it reduces the dimensionality of our data and lets us capture important patterns/trends without unnecessary aditional information. Also, by transforming the original features into principal components, we might be able to discover hidden factors that explain our variance more easily. A con of PCA is the potential loss of information in the process of dimensionality reduction. Aditionally, the principal components resulting from our use of PCA might not be directly interpretable.

(20 pts) K-Means Clustering

[5 pts] Apply K-means to the train data and print out the Inertia score

Use n_cluster = 5 and random_state = 0.

[10 pts] Find the optimal cluster size using the elbow method.

Use the elbow method to find the best cluster size or range of best cluster sizes for the train data. Check the cluster sizes from 2 to 20. Make sure to plot the Inertia and state where you think the elbow starts. Make sure to use random_state = 0.

The elbow is not exactly obvious, but the slow down seems to start around 10

[5 pts] Find the optimal cluster size for the train_pca data

Repeat the same experiment but use train_pca instead of train.

The elbow is once again not immediately obvious but the slow downn seems to start around 15

Notice that the inertia is much smaller for every cluster size when using PCA features. Why do you think this is happening? Hint: Think about what Inertia is calculating and consider the number of features that PCA outputs.

Response: PCA transform a high dimension feature space to a lower dimension space. Inertia is the sum of squared distances between each data point and its assigned cluster centroid. With a significantly lower number of features, the distances between data points become more concentrated in the lower-dimensional space, resulting in smaller inertia values for each cluster size.

(100 pts) Putting it all together

Through all the homeworks and projects, you have learned how to apply many different models to perform a supervised learning task. We are now asking you to take everything that you learned to create a model that can predict whether a hotel reservation will be canceled or not.

Context

Hotels see millions of people every year and always wants to keep rooms occupied and payed for. Cancellations make the business lose money since it may make it difficult to reserve to another customer on such short notice. As such, it is useful for a hotel to know whether a reservation is likely to cancel or not. The following dataset will provide a variety of information about a booking that you will use to predict whether that booking will cancel or not.

Property Management System - PMS

Attribute Information

(C) is for Categorical

(N) is for Numeric

1) is_canceled (C) : Value indicating if the booking was canceled (1) or not (0).
2) hotel (C) : The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel.
3) arrival_date_month (C): Month of arrival date with 12 categories: “January” to “December”
4) stays_in_weekend_nights (N): Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
5) stays_in_week_nights (N): Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights
6) adults (N): Number of adults
7) children (N): Number of children
8) babies (N): Number of babies
9) meal (C): Type of meal
10) country (C): Country of origin.
11) previous_cancellations (N): Number of previous bookings that were canceled by the customer prior to the current booking
12) previous_bookings_not_canceled (N) : Number of previous bookings not canceled by the customer prior to the current booking
13) reserved_room_type (C): Code of room type reserved. Code is presented instead of designation for anonymity reasons
14) booking_changes (N) : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
15) deposit_type (C) : No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay
16) days_in_waiting_list (N): Number of days the booking was in the waiting list before it was confirmed to the customer
17) customer_type (C): Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
18) adr (N): Average Daily Rate (Calculated by dividing the sum of all lodging transactions by the total number of staying nights)
19) required_car_parking_spaces (N): Number of car parking spaces required by the customer
20) total_of_special_requests (N): Number of special requests made by the customer (e.g. twin bed or high floor)
21) name (C): Name of the Guest (Not Real)
22) email (C): Email (Not Real)
23) phone-number (C): Phone number (not real)

This dataset is quite large with 86989 samples. This makes it difficult to just brute force running a lot of models. As such, you have to be thoughtful when designing your models.

The file name for the training data is "hotel_booking.csv".

Challenge

This project is about being able to predict whether a reservation is likely to cancel based on the input parameters available to us. We will ask you to perform some specific instructions to lead you in the right direction but you are given free reign on which models to use and the preprocessing steps you make. We will ask you to write out a description of what models you choose and why you choose them.

(50 pts) Preprocessing

Preprocessing: For the dataset, the following are mandatory pre-processing steps for your data:

After writing your preprocessing code, write out a description of what you did for each step and provide a justification for your choices. All descriptions should be written in the markdown cells of the jupyter notebook. Make sure your writing is clear and professional.

We highly recommend reading through the scikit-learn documentation to make this part easier.

Preprocessing Steps:

One-Hot Encoding: I applied one-hot encoding to all categorical features in the dataset. This converts categorical variables into binary vectors, allowing the machine learning models to work effectively with the data. For features with multiple values, I kept the extra feature. As mentioned, this creates a binary indicator for each possible category within a feature, capturing all the available information without assuming any ordinal relationship between categories.

Dropping Fields: I dropped the fields that are not useful for the prediction task or contain information that is not relevant for the model. In this case, I dropped the fields 'name', 'email', and 'phone-number' as they do not contribute to predicting whether a reservation will be canceled or not.

Handling Missing Values: There are only a few missing values, so I choose to remove those samples.

Feature Scaling: I rescaled the real-valued features in the dataset to a similar range to avoid any particular feature dominating the learning process. I used the StandardScaler, which scales the features to have zero mean and unit variance.

Augmenting Features: I calculated the total number of nights stayed by summing the 'stays_in_weekend_nights' and 'stays_in_week_nights' features. This can capture the overall duration of the stay and potentially improve the model's predictive performance.

Train-Test Split: I split the preprocessed data into training and testing sets using a 80:20 ratio

(50 pts) Try out a few models

Now that you have pre-processed your data, you are ready to try out different models.

For this part of the project, we want you to experiment with all the different models demonstrated in the course to determine which one performs best on the dataset.

You must perform classification using at least 3 of the following models:

Due to the size of the dataset, be careful which models you use and look at their documentation to see how you should tackle this size issue for each model.

For full credit, you must perform some hyperparameter optimization on your models of choice. You may find the following scikit-learn library on hyperparameter optimization useful.

For each model chosen, write a description of which models were chosen, which parameters you optimized, and which parameters you choose for your best model. While the previous part of the project asked you to pre-process the data in a specific manner, you may alter pre-processing step as you wish to adjust for your chosen classification models.

Logistic Regression is a linear model widely used for binary classification. It's computationally efficient and can handle large datasets. I optimized the C parameter, which represents the inverse of regularization strength. I tuned this parameter using cross-validation to find the optimal value that balances between overfitting and underfitting. The best model is selected based on the highest cross-validation accuracy.

Decision Tree is a non-linear model that learns decision rules to classify instances. It can handle large datasets but may overfit if not properly regularized. I optimized the max_depth parameter, which controls the maximum depth of the decision tree. I searched for the optimal value using cross-validation. The best model is selected based on the highest cross-validation accuracy.

KNN is a non-parametric classification algorithm that works by finding the k nearest neighbors to a given data point and classifying it based on the majority class of those neighbors. To optimize the KNN model, I focused on the n_neighbors parameter, which specifies the number of neighbors to consider when making predictions. I created a parameter grid knn_params with different values for n_neighbors [3, 5, 7]. I used GridSearchCV to perform a grid search with 5-fold cross-validation (cv=5) to evaluate the different values of n_neighbors and select the best performing model. After performing the grid search, I obtained the best model by accessing the bestestimator attribute of the GridSearchCV object.

MLP is a neural network that consists of multiple layers of nodes (neurons) and can learn complex non-linear relationships. To optimize the MLP model, I focused on hidden_layer_sizes and alpha:

hidden_layer_sizes: This parameter determines the number of neurons in each hidden layer of the MLP. I explored used (100, 100), meaning two hidden layers with 100 neurons each. The choice of the number of neurons and layers can impact the model's capacity to learn complex patterns.

alpha: This parameter controls the regularization term of the MLP, which helps prevent overfitting. I considered the values 0.0001, 0.001, and 0.01. The regularization term helps to reduce the impact of large weights in the model, making it more generalizable.

To find the best combination of hyperparameters, I used GridSearchCV with 5-fold cross-validation (cv=5). This evaluates the model's performance using different combinations of hyperparameters and selects the one with the highest cross-validated accuracy.

Extra Credit

We have provided an extra test dataset named "hotel_booking_test.csv" that does not have the target labels. Classify the samples in the dataset with your best model and write them into a csv file. Submit your csv file to our Kaggle contest. The website will specify your classification accuracy on the test set. We will award a bonus point for the project for every percentage point over 75% that you get on your kaggle test accuracy.

To get the bonus points, you must also write out a summary of the model that you submit including any changes you made to the pre-processing steps. The summary must be written in a markdown cell of the jupyter notebook. Note that you should not change earlier parts of the project to complete the extra credit.

Kaggle Submission Instruction Submit a two column csv where the first column is named "ID" and is the row number. The second column is named "target" and is the classification for each sample. Make sure that the sample order is preserved.

I filled the NA value in children with 0 because the mean of the children column is close to 0.

My best model was Multi-Layer Perceptron. MLP is a neural network that consists of multiple layers of nodes (neurons) and can learn complex non-linear relationships.

To optimize the MLP model, I did the same as above. To refresh, I focused on two main hyperparameters: hidden_layer_sizes and alpha.

hidden_layer_sizes: This parameter determines the number of neurons in each hidden layer of the MLP. I explored used (100, 100), meaning two hidden layers with 100 neurons each. The choice of the number of neurons and layers can impact the model's capacity to learn complex patterns.

alpha: This parameter controls the regularization term of the MLP, which helps prevent overfitting. I considered the values 0.0001, 0.001, and 0.01. The regularization term helps to reduce the impact of large weights in the model, making it more generalizable.

To find the best combination of hyperparameters, I used GridSearchCV with 5-fold cross-validation (cv=5). This evaluates the model's performance using different combinations of hyperparameters and selects the one with the highest cross-validated accuracy.

I preproccessed the data using the same pipeline as earlier in the assignment. Here is the description of those preprocessing steps again:

Preprocessing Steps:

One-Hot Encoding: I applied one-hot encoding to all categorical features in the dataset. This converts categorical variables into binary vectors, allowing the machine learning models to work effectively with the data. For features with multiple values, I kept the extra feature. As mentioned, this creates a binary indicator for each possible category within a feature, capturing all the available information without assuming any ordinal relationship between categories.

Dropping Fields: I dropped the fields that are not useful for the prediction task or contain information that is not relevant for the model. In this case, I dropped the fields 'name', 'email', and 'phone-number' as they do not contribute to predicting whether a reservation will be canceled or not.

Handling Missing Values: There are only a few missing values, so I choose to remove those samples.

Feature Scaling: I rescaled the real-valued features in the dataset to a similar range to avoid any particular feature dominating the learning process. I used the StandardScaler, which scales the features to have zero mean and unit variance.

Augmenting Features: I calculated the total number of nights stayed by summing the 'stays_in_weekend_nights' and 'stays_in_week_nights' features. This can capture the overall duration of the stay and potentially improve the model's predictive performance.

Train-Test Split: I split the preprocessed data into training and testing sets using a 80:20 ratio

This is no longer my best model, but it was at some point and I didn't feel like getting rid of it

I filled the NA value in children with 0 because the mean of the children column is close to 0.

My best model was KNN. KNN is a non-parametric classification algorithm that works by finding the k nearest neighbors to a given data point and classifying it based on the majority class of those neighbors. To optimize the KNN model, I focused on the n_neighbors parameter, which specifies the number of neighbors to consider when making predictions. I created a parameter grid knn_params with different values for n_neighbors [1, 3, 5, 7]. I used GridSearchCV to perform a grid search with 5-fold cross-validation (cv=5) to evaluate the different values of n_neighbors and select the best performing model. After performing the grid search, I obtained the best model by accessing the bestestimator attribute of the GridSearchCV object.